pacman::p_load(sf, tidyverse, funModeling, blorr, corrplot, ggpubr, spdep,GWmodel, tmap,skimr,caret,report)In class ex 5
Objective
In this exercise, we will be building a logistic regression model for the water point status at Osun state, Nigeria.
Importing Packages
Importing the Analytical Data
Osun <- read_rds('rds/Osun.rds')
Osun_wp_sf <- read_rds('rds/Osun_wp_sf.rds')osun contains the polygon boundaries while Osun_wp_sf contains the water point data in Osun Nigeria.
The rds files have been preprocessed - eg. cleaning up of variables and variable names.
Next, we check the status field of the Osun_wp_sf sf dataframe object. This field is derived from the original status_clean field where
observations that are null are filtered away
remaining values that indicate water points are functional are labelled T
else they are labelled F
Osun_wp_sf %>%
freq(input='status')Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.
ℹ The deprecated feature was likely used in the funModeling package.
Please report the issue at <]8;;https://github.com/pablo14/funModeling/issueshttps://github.com/pablo14/funModeling/issues]8;;>.

status frequency percentage cumulative_perc
1 TRUE 2642 55.5 55.5
2 FALSE 2118 44.5 100.0
tmap_mode("view")tmap mode set to interactive viewing
tm_shape(Osun)+
tmap_options(check.and.fix = TRUE) +
tm_polygons(alpha = 0.4) +
tm_shape(Osun_wp_sf) +
tm_dots(col = "status",
alpha = 0.6) +
tm_view(set.zoom.limits = c(9,12))tmap_mode("plot")tmap mode set to plotting
Exploratory Data Analysis (EDA)
Osun_wp_sf %>%
skim()Warning: Couldn't find skimmers for class: sfc_POINT, sfc; No user-defined `sfl`
provided. Falling back to `character`.
| Name | Piped data |
| Number of rows | 4760 |
| Number of columns | 75 |
| _______________________ | |
| Column type frequency: | |
| character | 47 |
| logical | 5 |
| numeric | 23 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1.00 | 5 | 44 | 0 | 2 | 0 |
| report_date | 0 | 1.00 | 22 | 22 | 0 | 42 | 0 |
| status_id | 0 | 1.00 | 2 | 7 | 0 | 3 | 0 |
| water_source_clean | 0 | 1.00 | 8 | 22 | 0 | 3 | 0 |
| water_source_category | 0 | 1.00 | 4 | 6 | 0 | 2 | 0 |
| water_tech_clean | 24 | 0.99 | 9 | 23 | 0 | 3 | 0 |
| water_tech_category | 24 | 0.99 | 9 | 15 | 0 | 2 | 0 |
| facility_type | 0 | 1.00 | 8 | 8 | 0 | 1 | 0 |
| clean_country_name | 0 | 1.00 | 7 | 7 | 0 | 1 | 0 |
| clean_adm1 | 0 | 1.00 | 3 | 5 | 0 | 5 | 0 |
| clean_adm2 | 0 | 1.00 | 3 | 14 | 0 | 35 | 0 |
| clean_adm3 | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| clean_adm4 | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| installer | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| management_clean | 1573 | 0.67 | 5 | 37 | 0 | 7 | 0 |
| status_clean | 0 | 1.00 | 9 | 32 | 0 | 7 | 0 |
| pay | 0 | 1.00 | 2 | 39 | 0 | 7 | 0 |
| fecal_coliform_presence | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| subjective_quality | 0 | 1.00 | 18 | 20 | 0 | 4 | 0 |
| activity_id | 4757 | 0.00 | 36 | 36 | 0 | 3 | 0 |
| scheme_id | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| wpdx_id | 0 | 1.00 | 12 | 12 | 0 | 4760 | 0 |
| notes | 0 | 1.00 | 2 | 96 | 0 | 3502 | 0 |
| orig_lnk | 4757 | 0.00 | 84 | 84 | 0 | 1 | 0 |
| photo_lnk | 41 | 0.99 | 84 | 84 | 0 | 4719 | 0 |
| country_id | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| data_lnk | 0 | 1.00 | 79 | 96 | 0 | 2 | 0 |
| water_point_history | 0 | 1.00 | 142 | 834 | 0 | 4750 | 0 |
| clean_country_id | 0 | 1.00 | 3 | 3 | 0 | 1 | 0 |
| country_name | 0 | 1.00 | 7 | 7 | 0 | 1 | 0 |
| water_source | 0 | 1.00 | 8 | 30 | 0 | 4 | 0 |
| water_tech | 0 | 1.00 | 5 | 37 | 0 | 20 | 0 |
| adm2 | 0 | 1.00 | 3 | 14 | 0 | 33 | 0 |
| adm3 | 4760 | 0.00 | NA | NA | 0 | 0 | 0 |
| management | 1573 | 0.67 | 5 | 47 | 0 | 7 | 0 |
| adm1 | 0 | 1.00 | 4 | 5 | 0 | 4 | 0 |
| New Georeferenced Column | 0 | 1.00 | 16 | 35 | 0 | 4760 | 0 |
| lat_lon_deg | 0 | 1.00 | 13 | 32 | 0 | 4760 | 0 |
| public_data_source | 0 | 1.00 | 84 | 102 | 0 | 2 | 0 |
| converted | 0 | 1.00 | 53 | 53 | 0 | 1 | 0 |
| created_timestamp | 0 | 1.00 | 22 | 22 | 0 | 2 | 0 |
| updated_timestamp | 0 | 1.00 | 22 | 22 | 0 | 2 | 0 |
| Geometry | 0 | 1.00 | 33 | 37 | 0 | 4760 | 0 |
| ADM2_EN | 0 | 1.00 | 3 | 14 | 0 | 30 | 0 |
| ADM2_PCODE | 0 | 1.00 | 8 | 8 | 0 | 30 | 0 |
| ADM1_EN | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| ADM1_PCODE | 0 | 1.00 | 5 | 5 | 0 | 1 | 0 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| rehab_year | 4760 | 0 | NaN | : |
| rehabilitator | 4760 | 0 | NaN | : |
| is_urban | 0 | 1 | 0.39 | FAL: 2884, TRU: 1876 |
| latest_record | 0 | 1 | 1.00 | TRU: 4760 |
| status | 0 | 1 | 0.56 | TRU: 2642, FAL: 2118 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| row_id | 0 | 1.00 | 68550.48 | 10216.94 | 49601.00 | 66874.75 | 68244.50 | 69562.25 | 471319.00 | ▇▁▁▁▁ |
| lat_deg | 0 | 1.00 | 7.68 | 0.22 | 7.06 | 7.51 | 7.71 | 7.88 | 8.06 | ▁▂▇▇▇ |
| lon_deg | 0 | 1.00 | 4.54 | 0.21 | 4.08 | 4.36 | 4.56 | 4.71 | 5.06 | ▃▆▇▇▂ |
| install_year | 1144 | 0.76 | 2008.63 | 6.04 | 1917.00 | 2006.00 | 2010.00 | 2013.00 | 2015.00 | ▁▁▁▁▇ |
| fecal_coliform_value | 4760 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| distance_to_primary_road | 0 | 1.00 | 5021.53 | 5648.34 | 0.01 | 719.36 | 2972.78 | 7314.73 | 26909.86 | ▇▂▁▁▁ |
| distance_to_secondary_road | 0 | 1.00 | 3750.47 | 3938.63 | 0.15 | 460.90 | 2554.25 | 5791.94 | 19559.48 | ▇▃▁▁▁ |
| distance_to_tertiary_road | 0 | 1.00 | 1259.28 | 1680.04 | 0.02 | 121.25 | 521.77 | 1834.42 | 10966.27 | ▇▂▁▁▁ |
| distance_to_city | 0 | 1.00 | 16663.99 | 10960.82 | 53.05 | 7930.75 | 15030.41 | 24255.75 | 47934.34 | ▇▇▆▃▁ |
| distance_to_town | 0 | 1.00 | 16726.59 | 12452.65 | 30.00 | 6876.92 | 12204.53 | 27739.46 | 44020.64 | ▇▅▃▃▂ |
| rehab_priority | 2654 | 0.44 | 489.33 | 1658.81 | 0.00 | 7.00 | 91.50 | 376.25 | 29697.00 | ▇▁▁▁▁ |
| water_point_population | 4 | 1.00 | 513.58 | 1458.92 | 0.00 | 14.00 | 119.00 | 433.25 | 29697.00 | ▇▁▁▁▁ |
| local_population_1km | 4 | 1.00 | 2727.16 | 4189.46 | 0.00 | 176.00 | 1032.00 | 3717.00 | 36118.00 | ▇▁▁▁▁ |
| crucialness_score | 798 | 0.83 | 0.26 | 0.28 | 0.00 | 0.07 | 0.15 | 0.35 | 1.00 | ▇▃▁▁▁ |
| pressure_score | 798 | 0.83 | 1.46 | 4.16 | 0.00 | 0.12 | 0.41 | 1.24 | 93.69 | ▇▁▁▁▁ |
| usage_capacity | 0 | 1.00 | 560.74 | 338.46 | 300.00 | 300.00 | 300.00 | 1000.00 | 1000.00 | ▇▁▁▁▅ |
| days_since_report | 0 | 1.00 | 2692.69 | 41.92 | 1483.00 | 2688.00 | 2693.00 | 2700.00 | 4645.00 | ▁▇▁▁▁ |
| staleness_score | 0 | 1.00 | 42.80 | 0.58 | 23.13 | 42.70 | 42.79 | 42.86 | 62.66 | ▁▁▇▁▁ |
| location_id | 0 | 1.00 | 235865.49 | 6657.60 | 23741.00 | 230638.75 | 236199.50 | 240061.25 | 267454.00 | ▁▁▁▁▇ |
| cluster_size | 0 | 1.00 | 1.05 | 0.25 | 1.00 | 1.00 | 1.00 | 1.00 | 4.00 | ▇▁▁▁▁ |
| lat_deg_original | 4760 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| lon_deg_original | 4760 | 0.00 | NaN | NA | NA | NA | NA | NA | NA | |
| count | 0 | 1.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▁▇▁▁ |
We will drop the variable install_year that has missing values and may not be useful in the logistic regression model later. We will be using the below variables.
Osun_wp_sf_clean <- Osun_wp_sf %>%
filter_at(vars(status,
distance_to_primary_road,
distance_to_secondary_road,
distance_to_tertiary_road,
distance_to_city,
distance_to_town,
water_point_population,
local_population_1km,
usage_capacity,
is_urban,
water_source_clean),
all_vars(!is.na(.))) %>%
mutate(usage_capacity = as.factor(usage_capacity))In the above code, we have also excluded missing values and recoded usage capacity to a categorical label.
Correlation Analysis
We select the required variables for plotting the correlation matrix and remove the geometry column.
Osun_wp <- Osun_wp_sf_clean %>%
select(c(7,35:39,42:43,46:47,57)) %>%
st_set_geometry(NULL)cluster_vars.cor = cor(Osun_wp[,2:7])
corrplot.mixed(cluster_vars.cor,
lower = 'ellipse',
upper = 'number',
diag = 'l',
tl.col = 'black'
)
None of the variables are highly correlated to any other variable, so we will be keeping all variables for the logistic regression model.
Building a logistic regression model
model <- glm(status ~ distance_to_primary_road +
distance_to_secondary_road +
distance_to_tertiary_road +
distance_to_town +
is_urban +
usage_capacity +
water_source_clean +
water_point_population +
local_population_1km,
data = Osun_wp_sf_clean,
family = binomial(link = 'logit'))We use blr_regress() of blorr to generate the model report in scientific literature reporting format.
# report(model)
blr_regress(model) Model Overview
------------------------------------------------------------------------
Data Set Resp Var Obs. Df. Model Df. Residual Convergence
------------------------------------------------------------------------
data status 4756 4755 4745 TRUE
------------------------------------------------------------------------
Response Summary
--------------------------------------------------------
Outcome Frequency Outcome Frequency
--------------------------------------------------------
0 2114 1 2642
--------------------------------------------------------
Maximum Likelihood Estimates
-----------------------------------------------------------------------------------------------
Parameter DF Estimate Std. Error z value Pr(>|z|)
-----------------------------------------------------------------------------------------------
(Intercept) 1 0.0981 0.0939 1.0441 0.2964
distance_to_primary_road 1 0.0000 0.0000 -2.2037 0.0275
distance_to_secondary_road 1 0.0000 0.0000 -0.6184 0.5363
distance_to_tertiary_road 1 1e-04 0.0000 3.5609 4e-04
distance_to_town 1 0.0000 0.0000 -3.5553 4e-04
is_urbanTRUE 1 -0.2145 0.0798 -2.6885 0.0072
usage_capacity1000 1 -0.6544 0.0693 -9.4488 0.0000
water_source_cleanProtected Shallow Well 1 0.4588 0.0852 5.3869 0.0000
water_source_cleanProtected Spring 1 1.3151 0.4374 3.0068 0.0026
water_point_population 1 -5e-04 0.0000 -11.5660 0.0000
local_population_1km 1 3e-04 0.0000 19.3293 0.0000
-----------------------------------------------------------------------------------------------
Association of Predicted Probabilities and Observed Responses
---------------------------------------------------------------
% Concordant 0.7299 Somers' D 0.4598
% Discordant 0.2701 Gamma 0.4598
% Tied 0.0000 Tau-a 0.2271
Pairs 5585188 c 0.7299
---------------------------------------------------------------
Variables with pvalue less than 0.05 are statistically significant at 95% confidence level. This leaves distance_to_secondary_road as an insignificant variable
For interpretation of logistic regression report:
Categorical variables: A positive value implies an above average correlation and a negative value implies a below average correlation, while the magnitude of the coefficient does not matter for categorical variables;
Continuous variables: a positive value implies a direct correlation and a negative value implies an inverse correlation, while the magnitude of the value gives the strength of the correlation.
blr_confusion_matrix(model,cutoff = 0.5)Confusion Matrix and Statistics
Reference
Prediction FALSE TRUE
0 1302 762
1 812 1880
Accuracy : 0.6690
No Information Rate : 0.4445
Kappa : 0.3283
McNemars's Test P-Value : 0.2168
Sensitivity : 0.7116
Specificity : 0.6159
Pos Pred Value : 0.6984
Neg Pred Value : 0.6308
Prevalence : 0.5555
Detection Rate : 0.3953
Detection Prevalence : 0.5660
Balanced Accuracy : 0.6637
Precision : 0.6984
Recall : 0.7116
'Positive' Class : 1
The validity of a cut-off is measured using sensitivity, specificity and accuracy.
Sensitivity: The % of correctly classified events out of all events = TP / (TP + FN)
Specificity: The % of correctly classified
Accuracy: The % of correctly classified events out of all events = (TP + TN) / (TP + FP + TN + FN)
From the output, we see that the model gives us an accuracy of 0.6739, which is a good start as it is better than guessing (0.5).
The sensitivity and specificity are 0.7207 and 0.6154 respectively. This shows that the true positives are slightly higher than the true negative prediction rates.
Building geographically weighted logistic regression models
Osun_wp_sp <- Osun_wp_sf_clean %>%
select(c(status,
distance_to_primary_road,
distance_to_tertiary_road,
distance_to_city,
distance_to_town,
water_point_population,
local_population_1km,
is_urban,
usage_capacity,
water_source_clean
)) %>%
as_Spatial()
Osun_wp_spclass : SpatialPointsDataFrame
features : 4756
extent : 182502.4, 290751, 340054.1, 450905.3 (xmin, xmax, ymin, ymax)
crs : +proj=tmerc +lat_0=4 +lon_0=8.5 +k=0.99975 +x_0=670553.98 +y_0=0 +a=6378249.145 +rf=293.465 +towgs84=-92,-93,122,0,0,0,0 +units=m +no_defs
variables : 10
names : status, distance_to_primary_road, distance_to_tertiary_road, distance_to_city, distance_to_town, water_point_population, local_population_1km, is_urban, usage_capacity, water_source_clean
min values : 0, 0.014461356813335, 0.017815121653488, 53.0461399623541, 30.0019777713073, 0, 0, 0, 1000, Borehole
max values : 1, 26909.8616132094, 10966.2705628969, 47934.343603562, 44020.6393368124, 29697, 36118, 1, 300, Protected Spring
Computing fixed bandwidth
bw.fixed <- bw.ggwr(status ~ distance_to_primary_road +
distance_to_tertiary_road +
distance_to_town +
is_urban +
usage_capacity +
water_source_clean +
water_point_population +
local_population_1km,
data = Osun_wp_sp,
family = 'binomial',
approach = 'AIC',
kernel = 'gaussian',
adaptive = FALSE,
longlat = FALSE
)bw.fixedUsing the bandwidth, we will model the geoweighted logistic regression model
gwlr.fixed <- ggwr.basic(status ~
distance_to_primary_road +
distance_to_tertiary_road +
distance_to_city +
distance_to_town +
water_point_population +
local_population_1km +
usage_capacity +
is_urban +
water_source_clean,
data = Osun_wp_sp,
bw = 2471.029,
family = "binomial",
kernel = "gaussian",
adaptive = FALSE,
longlat = FALSE) Iteration Log-Likelihood
=========================
0 -1954
1 -1675
2 -1527
3 -1443
4 -1407
5 -1407
Converting SDF into sf dataframe
To assess the performance of the gwLR, we will convert the SDF object in as data frame by using code below.
gwr.fixed <- as.data.frame(gwlr.fixed$SDF)Next, we will label yhat values greater or equal to 0.5 into 1 and else 0. The results of the logic comparison operation will be saved in to the field called most.
gwr.fixed <- gwr.fixed %>%
mutate(most = ifelse(
gwr.fixed$yhat >=0.5, T,F
))
gwr.fixed$y <- as.factor(gwr.fixed$y)
gwr.fixed$most <- as.factor(gwr.fixed$most)
CM <- confusionMatrix(data = gwr.fixed$most, reference = gwr.fixed$y)
CMConfusion Matrix and Statistics
Reference
Prediction FALSE TRUE
FALSE 1833 268
TRUE 281 2374
Accuracy : 0.8846
95% CI : (0.8751, 0.8935)
No Information Rate : 0.5555
P-Value [Acc > NIR] : <2e-16
Kappa : 0.7661
Mcnemar's Test P-Value : 0.6085
Sensitivity : 0.8671
Specificity : 0.8986
Pos Pred Value : 0.8724
Neg Pred Value : 0.8942
Prevalence : 0.4445
Detection Rate : 0.3854
Detection Prevalence : 0.4418
Balanced Accuracy : 0.8828
'Positive' Class : FALSE
Osun_wp_sf_selected <- Osun_wp_sf_clean %>%
select(c(ADM2_EN, ADM2_PCODE, ADM1_EN, ADM1_PCODE, status))gwr_sf.fixed <- cbind(Osun_wp_sf_selected, gwr.fixed)Visualising coefficient estimates
tmap_mode("view")tmap mode set to interactive viewing
prob_T <- tm_shape(Osun) +
tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf.fixed) +
tm_dots(col = 'yhat',
border.col = 'gray60',
border.lwd = 1) +
tm_view(set.zoom.limits = c(8,14))
prob_T